INN Hotels

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Dictionary

Load Libraries and Customize Displays

FIRST LOOK

Observation:

Sanity Check

Observation:

2.6% of bookings are by repeat guests

FIRST LOOK Observations:

Questions for EDA:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. What are the differences in room prices in different market segments?
  4. What are the differences in room type in different market segments?
  5. What percentage of bookings are canceled?
  6. What percentage of repeating guests cancel?
  7. How are all the variables correlated to booking status?
  8. Do special requirements affect booking cancellation?
  9. How has the average price of rooms changed over time?

Further Questions:

Exploratory Data Analysis

First let's prepare the columns for plotting

We are getting 37 errors because 2018 was NOT a leap year, so Febrary 29, 2018 is not a valid date.

Let's make a new dataframe that excludes those invalid dates

Let's test the validity of the rest of the dates in this subset

Observation:

Interpretation:

Observation:

Observation:

Observation:

The busiest months are October, June, and September. The slowest month is January.

Observation:

Observation:

Observation:

Observation:

Observation:

Observation:

Observation:

Observation:

Observation:

Observation:

Data Preprocessing and Feature Engineering

Second Round of EDA

It's a good idea to explore the data again after manipulations

Observations:

No further EDA needed since we didn't do any transformations, outlier treatments, or missing value treatments

Split the Data into Target and Features and add Constant

Check for Multicollinearity

Observation:

Split the Data into Training and Test Sets

Building a Logistic Regression model

Observation:

Removing the high P-values did not improve the model on any metric, so let's stick with the first training set.

Observation:

The confusion matrix on the training set of the first regression model

True Positives (TP)(1,1): we correctly predicted cancellation 21%

True Negatives (TN)(0,0): we correctly predicted NO cancellation 60%

False Positives (FP)(1,0): we incorrectly predicted cancellation (a "Type I error") 7%

False Negatives (FN)(0,1): we incorrectly predicted NO cancellation (a "Type II error") 12%

Which error is more costly?

If the hotel predicts a booking will cancel (1) and they actually don't (0) (Type I error), the hotel might be understaffed and overbooked and customer service might suffer.

If the hotel predicts a booking will not cancel (0) and they actually do (1) (Type II error), the hotel loses revenue on the room and is subject to costly, last-minute changes.

Type II erros is riskier and costs more and thus False Negatives need to be reduced.

Let's improve Recall Score by changing the threshold and see if we can find a balance of costs...

Observation:

By changing the threshold to 0.34, we have increased the True Positives and reduced the False Negatives

On the training set, True Positives went up from 21% to 25%, and False Negatives went down from 12% to 8%

Let's check the Recall score with the 0.34 threshold...

See below that training Recall went up from 65% to 77%

Can we do better than 77% Recall Score?

Let's use Precision-Recall curve and see if we can find a better threshold

The Precision-Recall threshold of .44 performed worse on Recall and better on Precision

Since no-shows hurt the hotel more than overbookings, we will stick to the .34 threshold that had a better recall score

Final Logistic Regression Model Summary

Building a Decision Tree Model

Observation:

Let's look at the important features in this first decision tree model

Using GridSearch for Hyperparameter tuning of our tree model

Observation:

The hyperparameter tuning using GridSearchCV on the Decision Tree model increased the Test Recall Score from .80 to .95

The F1 score went down from .79 to .62

Observation:

The important features have definitely changed in this hyperparameter-tuned tree model, so next we try post-pruning

Post Pruning

Now that we've made the best model using post-pruning, let's compare the Training and Test scores

Observation:

Let's visualize the important features of the post-pruned decision tree...

Conclusions and Recommendations

The best and most balanced model for predicting cancellations is the Post-pruned Decision Tree model.